How to build a payments system that scales to infinity (with examples)

Everybody, from SaaS applications to retailers, has to deal with payment processing. But architecting a system that can deal with payments at scale is challenging. That’s because payments system requirements can be prohibitively complex:

Payments often only matter for the 60-90 seconds during which a payment transaction takes place. However, because of the possibility of things like returns, refunds, disputes, and audits, payments data typically needs to remain available for years.
Successful businesses generate a lot of payments data as they scale. Storing all of this data and keeping it available without bogging down your application can be challenging.
Consistency and correctness are critical. Customers (and auditors) are not forgiving when it comes to money.
Availability is critical, since even a very small amount of downtime could equate to a lot of lost revenue at scale.

busy-time-payments-reference-architecture

Imagine a payments system outage during halftime. Just a few minutes of downtime can easily mean a major loss of revenue.

Let’s take a look at how payments work, and how to build systems that can meet all of the above requirements. If you’d prefer to learn watch a video about payments systems architecture you can check out this interview with 16 year FinTech engineering veteran Alex Lunev:

How payments systems work in the big picture

First, it’s worth taking a look at how payments actually work.

Whether a payment is coming from an online shop, a brick-and-mortar retailer, a SaaS app, etc. the process is generally pretty similar: a payment authorization request is sent from the terminal (i.e., the application, cash register, etc.) to a payment provider, and from there the request follows a chain back to the customer’s original card issuer. The card issuer returns an authorized or declined response, which is sent back to the terminal via the same path. Along the way, the payment provider will typically also store details related to the authorization request and response in a database and a data warehouse via something like Apache Kafka.

Assuming the payment is authorized, the terminal then initiates payment capture. The captured amount may be different from the amount in the initial payment request in some cases (due to tips, etc.). The capture transaction follows the same path as the authorization request, although this is often not instantaneous – for efficiency, these transactions are often cached and then processed in bulk (for example, once per hour). Both the originating terminal system (i.e. the SaaS application, online shop, etc.) and the payment provider will also generally store the payment capture data in a transactional database.

Scale: the biggest challenge for a payments database

The actual process of storing payment data in a database isn’t complex or particularly different from storing other types of data. At smaller scales, the consistency and correctness requirements for payments are easy enough to meet with the ACID transactional guarantees offered by most SQL databases.

However, as a company scales up, the regular transactional load placed on the database, as well as the ever-increasing total volume of data, can quickly become a performance bottleneck. A single Postgres or MySQL instance, for example, should perform very well for an application that’s only getting a few transactions per minute. But even a moderately successful business can quickly get to a point of dealing with hundreds of payments per minute on average, with much higher spikes during peak periods.

Banking resilience at global scale with Distributed SQL

Maintaining performance as your business grows requires finding a solution that allows you to scale up your payments database without sacrificing correctness or availability.

Approach 1: Manual sharding

The traditional approach to scaling a relational database like (for example) MySQL is manual sharding. This makes it possible to scale horizontally – you simply break your payments table into multiple shards running on multiple machines, and configure your application logic to route payment requests to the relevant shard or shards in a way that spreads the load so that no single machine gets overworked and becomes a bottleneck. Over time, as the company scales, you simply add more machines and more shards.

This approach certainly works, and it can even present an interesting engineering challenge at first. However, sharding requires either a lot of direct manual work or a really sophisticated automation/ops setup (which itself requires a lot of manual work). Each time you need to scale the database up, that work becomes more complex, as you’re increasing the number of separate shards your application has to deal with.

In the diagram above, the company has had to shard on four separate occasions, each time doubling the complexity of their setup. And that’s just the beginning; a highly successful company might need 64 or more separate shards to be able to maintain performance. What may start as a fun technical challenge quickly turns into something frustratingly Sisyphean. As long as your company grows, you can’t solve the problem of manual sharding. You are doomed to return to it, again and again, each time more scale is needed.

Additionally, scaling geographically may ultimately require that specific data, such as customer data, be stored in specific locations for performance and/or regulatory reasons. These requirements add an additional layer of complexity to the sharding problem, and may require re-sharding to ensure that all data is stored in shards that are physically located in the right places.

Approach 2: automatic scale and future-proofing with distributed SQL

Thankfully, that’s not the only approach! Modern distributed SQL databases can offer the same ACID guarantees as traditional relational DBMS together with the easy horizontal scaling of NoSQL.

For example, here’s how that same company’s scale-up process might look using CockroachDB:

Note that in the diagram above, no manual work is required. Depending on the method of consumption, CockroachDB either scales entirely automatically (as in CockroachDB Serverless) or by simply adding nodes (CockroachDB Dedicated), a process that takes less than a minute.

More importantly, regardless of the number of nodes your CockroachDB cluster is running on, nothing about your application logic has to change. Unlike with sharding, where you’re creating multiple shards of the same table and have to write flows to ensure requests go to the right shard, CockroachDB can always be treated as a single logical database.

This feature also makes scaling geographically much simpler. CockroachDB allows table- and row-level data homing, making it easy to assign rows to specific geographic regions. But developers don’t have to think about this complexity and can still treat CockroachDB the same way they’d treat a single Postgres instance. The database handles the geographical distribution and data replication – all of that manual sharding work – automatically.

This becomes more visually apparent when we look at things at the table level in a sharded MySQL database when compared to CockroachDB:

With a sharded payments table, everything from writing your application logic to querying data becomes challenging because all requests have to be routed to the correct physical database and shard. In CockroachDB, by contrast, you’re dealing with a single payments table regardless of your scale or node count, and sending data from the application or running queries against this table works exactly the same whether you’re dealing with ten transactions a day or ten million.

It also works the same whether you’ve got a single-region architecture or a multi-region setup with data homing so that data is always stored closest to where users are likely to access it (and/or where local regulations require it to be stored).

CockroachDB for payment processing

Zooming back out, let’s take a look at where CockroachDB fits into the broader context of processing payments.

Formed on a multi-cloud platform: Scalable, resilient payment technology

CockroachDB in a SaaS application architecture

For any “terminal” application such as a SaaS app or an online shop, CockroachDB serves as the primary transactional database, storing payment authorization request data and capture data when it’s returned from the payment processor. If desired, this data can also be easily synced to Apache Kafka or similar services via CockroachDB’s built-in change data capture (CDC) feature.

CockroachDB in payment provider architecture

Payment providers themselves also need to store their payments data, and CockroachDB thus occupies a similar position as the primary transactional database in a payment service’s architecture.

If you’re interested in learning more about how CockroachDB fits in FinTech you can check out our dedicated Finserv page and watch this short video about another use finserv use case from our customer: Spreedly